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Abstract 

Wind energy plays an increasing role in the supply of energy world-wide. The energy 
output of a wind farm is highly dependent on the weather condition present at the wind 
farm. If the output can be predicted more accurately, energy suppliers can coordinate the 
{NJ collaborative production of different energy sources more efficiently to avoid costly overpro- 

O | ductions. 

With this paper, we take a computer science perspective on energy prediction based 
00 on weather data and analyze the important parameters as well as their correlation on the 

energy output. To deal with the interaction of the different parameters we use symbolic 
regression based on the genetic programming tool DataModeler. 
| | ' Our studies are carried out on publicly available weather and energy data for a wind farm 

in Australia. We reveal the correlation of the different variables for the energy output. The 
model obtained for energy prediction gives a very reliable prediction of the energy output 
for newly given weather data. 

1 Introduction 

(N 

CN Renewable energy such as wind and solar energy plays an increasing role in the supply of energy 

world-wide. This trend will continue because the global energy demand is increasing and the 
use of nuclear power and traditional sources of energy such as coal and oil is either considered 
as non-safe or leads to a large amount of CO2 emission. 

Wind energy is a key-player in the field of renewable energy. The capacity of wind energy 
production was increased drastically during the last years. In Europe for example, the capacity 
of wind energy production has doubled since 2005. However, the production of wind energy is 
hard to predict as it relies on the rather unstable weather conditions present at the wind farm. 
In particular, the wind speed is crucial for energy production based on wind and the wind speed 
may vary drastically during different periods of time. Energy suppliers are interested in accurate 
predictions, as they can avoid overproductions by coordinating the collaborative production of 
traditional power plants and weather dependent energy sources. 

Our aim is to map weather data to energy production. We want to show that even data that 
is publicly available for weather stations close to wind farms can be used to give a good prediction 
of the energy output. Furthermore, we examine the impact of different weather conditions 
on the energy output of wind farms. We are, in particular, interested in the correlation of 
different components that characterize the weather conditions such as wind speed, pressure, 
and temperature 

A good overview on the different methods that were recently applied in forecasting of wind 
power generation can be found in [2]. Statistical approaches use historical data to predict the 
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wind speed on an hourly basis or to predict energy output directly. On the other hand, short 
term prediction is often done based on meteorological data and learning approaches are applied. 
Kusiak, Zheng, and Song [8] have shown how wind speed data may be used to predict the power 
output of a wind farm based on times series prediction modeling. Neural networks are a very 
popular learning approach for wind power forecasting based on given time series. They provide 
an implicit model of the function that maps the given weather data to an energy output. 

Jursa and Rohrig [3] have used particle swarm optimization and differential evolution to 
minimize the prediction error of neural networks for short-term windpower forecasting. Kramer 
and Gieseke [7] used support vector regression for short term energy forecast and kernel methods 
and neural networks to analyze wind energy time series [6] . These studies are all based on wind 
data and do not take other weather conditions into account. Furthermore, neural networks 
have the disadvantage that they give an implicit model of the function predicting the output, 
and these models are rarely accessible to a human expert. Usually, one is also interested in the 
function itself and the impact of the different variables that determine the output. We want to 
study the impact of different variables on the energy output of the wind farm. Surely, the wind 
speed available at the wind farm is a crucial parameter. Other parameters that influence the 
energy output are for example air pressure, temperature and humidity. Our goal is to study 
the impact and correlation of these parameters with respect to the energy output. 

Genetic programming (GP) (see [9] for a detailed presentation) is a type of evolutionary 
algorithm that can be used to search for functions that map input data to output data. It has 
been widely used in the field of symbolic regression and the goal of this paper is to show how it 
can be used for the important real-world problem of predicting energy outputs of wind farms 
from whether data. The advantage of this method is that it comes up with an explicit expression 
mapping weather data to energy output. This expression can be further analyzed to study the 
impact of the different variables that determine the output. To compute such an expression, 
we use the tool DataModeler p] which is the state of the art tool for doing symbolic regression 
based on genetic programming. We will use DataModeler also to carry out a sensitivity analysis 
which studies the correlation between the different variables and their impact on the accuracy 
of the prediction. 

We proceed as follows. In Section [2j we give a basic introduction into the field of genetic 
programming and symbolic regression and describe the DataModeler. Section [3j describes our 
approach of predicting energy output based on weather data and in Section [4] we report on 
our experimental results. Finally, we finish with some concluding remarks and topics for future 
research. 

2 Genetic Programming and DataModeler 

Genetic programming [5j is a type of evolutionary algorithm that is used in the field of machine 
learning. Motivated by the evolution process observed in nature computer programs are evolved 
to solve a given task. Such programs are usually encoded as syntax expression trees. Starting 
with a given set of trees called the population, new trees called the offspring population are 
created by applying variation operators such as crossover and mutation. Finally, a new parent 
population is selected out of the previous parent and the offspring based on how good these 
trees perform for the given task. 

Genetic programming has its main success stories in the field of symbolic regression. Given 
a set of input output vectors, the task is to find a function that maps the input to the output 
as best as possible, while avoiding overfitting. The resulting function is later often used to 
predict the output for a newly given input. Syntax trees represent functions in this case and 
the functions are changed by crossover and mutation to produce new functions. The quality of 
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a syntax trees is determined by how good it maps the given set of inputs to their corresponding 
outputs. 

The task in symbolic regression can be stated as follows. Given a set of data vectors 
(xu, X2i, • • • , Xki, m) G H fc+1 , 1 < i < n, find a function / : ~R k — > R such that the approximation 
error, e.g. the root mean square error 

with Xj = (xu,X2i, • • • , Xfcj) is minimized. 

We want to use a tool called DataModeler for our investigations. It is based on genetic 
programming and designed for solving symbolic regression problems. 



2.1 DataModeler 

Evolved Analytics' DataModeler is a complete data analysis and feature selection environment 
running under Wolfram Mathematica 8. It offers a platforms for data exploration, data-driven 
model building, model analysis and management, response exploration and variable sensitivity 
analysis, model-based outlier detection, data balancing and weighting. 

Data-driven modeling in DataModeler happens by symbolic regression via genetic program- 
ming. The SymbolicRegression function offers several evolutionary strategies which differ in 
the applied selection schemes, elitism, reproduction strategies, and fitness evaluation strategies. 
An advanced user can take full control over symbolic regression and introduce new function 
primitives, new fitness functions, selection and propagation schemes, etc. by specifying appro- 
priate options in the function call. We, however, used the default settings, and default evolution 
strategy, called in DataModeler ClassicGPQ 

In the symbolic regression performed here a population of individuals (syntax trees) is 
evolving over a variable number of generations at the Pareto front in the three dimensional 
objective space of model complexity, model error, and model age [3} [10]. 

Model error in the default setting ranges between and 1 with the best value of 0. It is 
computed as 1 — R 2 , where R is a scaled correlation coefficient. The correlation coefficient of the 
predicted output is scaled to have the same mean and standard deviation as observed output. 

The model complexity is the expressional complexity of models, and it is computed as the 
total sum of nodes in all subtrees of the given GP tree. The model age is computed as the 
number of generations that the model survived in the population. The age of a child individual 
is computed by incrementing the age of the parent contributing to the root node of the child. 
We use the age as a secondary optimization objective, as it is used only internally for evolution. 
At the end of symbolic regression runs results are displayed in the two-objective space of user- 
selected objectives, in our case these objectives are model expressional complexity and 1 — R 2 . 

The default population size is 300. The default elite set size is 50 individuals from the 'old' 
population closest to the 3-dimensional Pareto front in the objective space. These individuals 
are copied to the 'new' population of 300 individuals, after which the size of the new population 
is decreased down to the necessary 300 This is done by selecting models from Pareto layers until 
the specified amount is found. 

The Selection of individuals for propagation happens by means of Pareto tournaments. By 
default, 30 models are randomly sampled from the current population, and Pareto optimal 

1 All models reported in this paper were generated using two calls of SymbolicRegression with only the following 
arguments: input matrix, response vector, execution time, number of independent evolutions, an option to archive 
models with a certain prefix-name, and a template specification. 
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Table 1: Internal regression model representation in DataModeler for the model with an ex- 
pression -25.2334 + 3.21666 • windGust 2 (see also Figure [I]: 
GPModel [{11, 0.300409}, £[-25.2334, IT[3. 21666, windGust2]], 

{ModelAge -> 1, ModelingObjective -> ({ModelComplexity[#l], 1 - AbsoluteCorrelation[#2, #3] 2 } &) , 
ModelingObjectiveNames — > {Complexity, l-R 2 }, 

DataVariables — ¥ {year, month, day, hour, minute, temperature, apparent Temperature, dewPoint, relativeHumidity, 
wetBulbDepression, windSpeed, windGust, windSpeed2, windGust2, pressureQNH, rainSince9am}, 
DataVariableRange -> {{2010, 2011}, {1, 12}, {1, 31}, {0, 23}, {0, 30}, {4.2, 23.4}, {-14.2, 24.}, {-3.2, 19.1}, {40, 100}, 
{0., 6.6}, {0, 106}, {0, 130}, {0, 57}, {0, 70}, {987.8, 1037.5}, {0., 50.4}}, RangeExpansion -> None, 

ModelingVariables — > {year, month, day, hour, minute, temperature, apparent Temperature, dewPoint, relativeHumidity, 
wetBulbDepression, windSpeed, windGust, windSpeed2, windGust2, pressureQNH, rainSince9am}, 
FunctionPatterns -> {£[_, _],II[_, __], D[_, _],§[_,_], P2[_], §Q[_], IV[_], M[_]}, StoreModelSet -> True, 

ProjectName — > fullDataAHVars, TemplateTopLevel — > {£[_, _ ]}, TimeConstraint — > 2000, IndependentEvolutions — > 10}]. 




Figure 1: Model tree plot of the individual from Table [TJ Model complexity is the sum of nodes 
in all subtrees of the given tree (11). Model error computed as 1 — R 2 = 0.30. 

individuals from this sample are determined as winners to undergo variation until a necessary 
number of new individuals is created. 

Models are coded as parse-trees using the GPmodel structure, which contains placeholders 
for information about model quality, data variables and ranges used to develop the model, and 
some settings of symbolic regression. For example, the internal GPmodel representation of the 
hrst Pareto front model from a set of models from Figure [3] with an expression —25.2334 + 
3.21666windGust2 is presented in Table [TJ Note, that the first vector inside the GPmodel 
structure represents model quality. Model complexity is 11, model error is 0.300409. The parse 
tree of the same model is plotted in Figure [TJ 

When a specified execution threshold of a run in seconds is reached, the independent evolu- 
tion run terminates and a vector of model objectives in the final population is re-evaluated to 
only contain model complexity and model error. The set of models can further be analyzed for 
variable drivers, most frequent variable combinations, behavior of the response, consistency in 
prediction, accuracy vs. complexity trade-offs, etc. 

When the goal is the prediction of the output in the unobserved region of the data space, it 
is essential to use 'model ensemble' rather than individual models for this purpose. Because of 
built-in niching, complexity control, and independent evolutions used in DataModeler's symbolic 
regression, the final models are developed to be diverse (with respect to structural complexity, 
model forms, residuals), but they all are global models, built to predict training response in 
the entire training region. Due to diversity and high quality, rich sets of final models allow 
us to select multiple individuals to model ensembles. Prediction of a set of individuals is then 
computed as a median or a median average of individual predictions of ensemble members, 
while disagreement in the predictions (standard deviation in this paper) is used to specify the 
confidence interval of prediction. When models are extrapolated, the confidence of predictions 
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naturally deteriorates and confidence intervals become wider. This allows first, a more robust 
prediction of the response (since over-fitting is further mitigated by choosing models of dif- 
ferent accuracy and complexity into an ensemble), and second, it makes the predictions more 
trustworthy, since predictions are also supplied with confidence intervals. 

To select ensembles we used a built-in function of DataModeler, that focuses on most typical 
individuals of the model set as well as on individuals that have least correlated residuals. Because 
of space constraints we refer the reader to pQ for further information. 



3 Our Approach 

The main goal of this paper is to use public data to check feasibility of wind energy prediction 
by using a industrial-strength off-the-shelf non-linear modeling and feature selection tool. In 
our study, we investigate and predict the energy production of the wind farm Woolnorth in 
Tasmania, Australia based on publicly available data. The energy production data is made 
publicly available by the Australian Energy Market Operator (AEMO) in real time to assist 
in maintaining the security of the power systemj^] For the creation of our models and the 
prediction, we associate the wind farm with the Australian weather station ID091245, located 
at Cape Grim, Tasmania. Its data is available for free for a running observation time window 
of 72 hoursJD 



3.1 Data 

We collected both the weather and energy production data for the time window September 2010 
till July 2011. The output of the farm is available with a rate of one measurement every five 
minutes, and the weather data with a rate of one measurement every 30 minutes. 

The wind farm's production capacity is split into two sites, which complicated the generation 
of models. The site "Studland Bay" has a maximum output of 75 MW, and "Bluff Point" has a 
maximum output of 65 MW and is located 50km south of the first site. The weather station is 
located on the first site. For wind coming from west (which is the prevailing wind direction), the 
difference in location is negligible. But if wind comes from north, there will be an energy and 
wind increase right away, plus another energy increase 1-2 hours later (the time delay depends 
on the actual wind speed). Similarly, if wind comes from south, there will be an increase in the 
energy production (although no wind is indicated by the weather station) and then 1-2 hours 
later an energy increase accompanied by a measured wind speed increase. 



3.2 Data pre-processing 

To perform data modeling and variable selection on collected data, we had to perform data 
pre-processing to create a table of weather and energy measurements taken at the same time 
intervals. Energy output of the farm is measured at the rate of 5 minutes, including the time 
stamps of and 30 minutes of every hour when the weather is measured. Our approach was 
to correlate weather measurements with the average energy energy output of the farm reported 
in the [0, 25] and [30, 55] minute intervals of every hour. Such averaging makes modeling more 
difficult, but uses all energy information available. 



2 Australian Landscape Guardians: AEMO Non-Scheduled Generation Data: Iww7laridscapeguardians.org. 
au/data/aemo/ (last visited August 31st, 2011 



'Australian Government, Bureau of Meteorology: weather observations for Cape Grim: www.bom.gov.au/ 



products/IDT60801/IDT60801. 94954. shtml (last visited August 31st, 2011) 
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Different time scales used in the weather and energy data were automatically converted to 
one scale using a DateList function of Wolfram Mathematica 8, which is the scientific computing 
environment in which DataModeler operates. 

Because of many missing, erroneous, and duplicate time stamps in the weather data we 
obtained 11022 common measurements of weather and averaged energy produced by the farm 
from October 2010 till June 2011. These samples were used as training data to build regression 
models. From 18 variables of the weather data at Cape Grim we excluded two variables prior 
to modeling: the Pressure MSL variable had more than 75% missing values and the Wind 
Direction variable was non-numeric. 

As test data we used 1408 common half-hour measurements of weather and averaged energy 
in July 2011. 

3.3 Data Analysis and Model Development 

As soon as weather and energy data from different sources were put in an appropriate input- 
output form, we were able to apply a standard data-driven modeling approach to them. 

A good approach employs iterations between three stages: Data Collection/Reduction, 
Model Development, and Model Analysis and Variable Selection. In hard problems many itera- 
tions are required to identify a subspace of minimal dimensionality where models of appropriate 
accuracy and complexity trade-offs can be built. 

Our problem is challenging for several reasons. First, it is hard to predict the total wind 
energy output of the farm in half an hour following the moment when weather is measured, 
especially when the weather station is several kilometers away from the farm). Second, public 
data does not offer any information about the wind farm except for wind energy output. Third, 
our training data covers the range of weather conditions observed only between October 2010 
and June 2011 while the test data contains data from July implying that our models must 
have good generalization capabilities as they will be extrapolated to the unseen regions of the 
data space. And lastly, our challenging goal is to use all 16 publicly available numeric weather 
characteristics for energy output prediction, while many of them are heavily correlated (see 
Table [2]). 

Multi-collinearity in hard high-dimensional problems is a major hurdle for most regression 
methods. Symbolic regression via GP is one of the very few methods which does not suffer from 
multicollinearity and is capable of naturally selecting variables from the correlated subset for 
final regression models. 

Because ensemble-based symbolic regression and robust variable selection methodology are 
implemented in DataModeler we settled for a standard model development and variable selection 
procedures using default settings. 

The modeling goals of this study are: 

1. to identify the minimal subset of driving weather features that are significantly related to 
the wind energy output of the wind farm, 

2. to let genetic programming express these relationships in the form of explicit input-output 
regression models, and 

3. to select model ensembles for improved generalization capabilities of energy predictions 
and to analyze the quality of produced model ensembles using an unseen test set. 

Our approach is to achieve these goals using two iterations of symbolic regression modeling. 
At the first exploratory stage we run symbolic regression on training data to identify driving 
weather characteristics significantly related to the energy output. At the second modeling stage 



6 



CorrelationMatrixPlot of Data Columns 
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Figure 2: Data variables are heavily correlated (Blue - positively, Red - negatively). 

we reduce the training data to the set of selected inputs and run symbolic regression to obtain 
models, and model ensembles for predicting energy output. 

4 Experimental Results 

4.1 Experimental setup 

The setup of symbolic regression used default settings of DataModeler except for the number of 
independent runs, execution time of each run and the template operator at the root of the GP 
trees. We executed 10 independent evolutionary runs of 2000 seconds in both stages. The root 
node of all GP trees was fixed to a Plus. The primitives for regression models consisted of an ex- 
tended set of arithmetic operators: {Plus, Minus, Subtract, Divide, Times, Sqrt, Square, Inverse}. 
The maximum arity of Plus and Times operators is limited to 5. 

Model trees have terminals labelled as variables or constants (random integers or reals), 
with a maximum allowed model complexity of 1000. Population size is 300, elite set size is 
50. Population individuals are selected for reproduction using Pareto tournaments with the 
tournament size of 30. Propagation operators are crossover (at rate 0.9), subtree mutation 
(rate 0.05), and depth preserving subtree mutation (rate 0.05). At the end of each independent 
evolution the population and archive individuals are merged together to produce a final set 
of models. At each stage of experiments the results of all independent evolutions are merged 
together to produce a super set of solutions (see an example in Figure [3]). 

For model analysis we applied additional model selection strategies to these super sets of 
models. We describe the additional model selection strategies, discovered variable drivers, final 
models, and the quality of predictions in the next section. 

4.2 Feature selection 

The initial set of experiments targets the feature selection, using all 16 input variables and all 
training data from October 2010 till June 2011. In the allowed 2000 seconds each symbolic 
regression run completed at most 217 generations. 

The 10 independent evolutions generated a super set of 4450 models. We reduced this set 
to robust models only, by applying interval arithmetic to remove models with potential for 
pathologies and unbounded response in the training data range. This generated 2559 unique 
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Figure 3: A super set of models generated in the first stage of experiments with 10 independent 
evolutions using all inputs. Red dots are Pareto front models, which are non-dominated trade- 
offs in the space of model complexity and model error. 
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Figure 4: Selected set Aii of 'best' models in all variables and two modeling objectives. 



robust models, and from those we selected the final set Aii. This set contained 587 individuals 
with the model error not exceeding 0.30, and model complexity not exceeding 350, which lie 
closest to the Pareto front in Model Complexity versus Model Error objective space. The set 
Aii is depicted in Figure [4] with Pareto front individuals indicated in red. The limit 350 on 
Model Complexity preserved the best of the run model (the right-most red dot), but excluded 
dominated individuals with model complexities up to 600. 

We used the set Aii to perform variable presence and variable contribution analysis to iden- 
tify the variable drivers significantly related to energy output. The presence of input variables 
in models from Aii is vizualised in Figure [5] and Figure [6} We can observe from Figure [6] that 
the six most frequently used variables are (in the order of decreasing importance) windGust2, 
windGust, dewPoint, month, relativeHumidity, and pressureQNH. While we observe that these 
variables are most frequently used in a good set of candidate solutions in Aii, it is some- 
what hard to define a threshold on these presence-based variable importances to select variable 
drivers. For example it is unclear whether we should select the top three, top four, or top five 
inputs. 

For a crisper feature selection analysis we performed a variable contribution analysis using 
DataModeler to see how much contribution does each variable have to the relative error of the 
model where it is present. The median variable contributions computed using the model set 
Aii are depicted in Figure [7} The plot clearly demonstrates that the contribution is negligible 
of other variables besides the top three mentioned above and identified using variable presence 
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Figure 5: Presence of input variables in the selected set M\. 
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Figure 6: Presence of input variables in the selected set of models. 

analysis. 

Results of the first stage of experiments suggest that the weather inputs windGust2, wind- 
Gust, and DewPoint are 1) the most frequently present in A4± and 2) have the highest contri- 
bution to the relative errors of models in A4i and are sufficient to achieve the accuracy of M\. 
In other words these inputs are sufficient to predict energy output with accuracy between 70% 
and 80% R 2 on the training data. 

The high correlation between windGust and windGust2 variables motivated us to select only 
one of them for the second round of modeling together with dewPoint to generate prediction 
models. Symbolic regression does not guarantee that only one particular input variable out of 
the set of correlated inputs will be present in final models. It might be that either only one out 
of two is sufficient to predict the response with the same accuracy, or that both are necessary for 
success. Our choice was to select the windGust2 (as the most frequent variable in the models) 
together with dewPoint for the second stage of experiments and see whether predictive accuracy 
of new models in the new two-dimensional design space will not get worse, when compared to 
the accuracy of A4i models developed in the original space of 16 dimensions. 

4.3 Energy output prediction 

The second stage of experiments used only the two input variables windGust2 and dewPoint, 
with all other symbolic regression settings identical to the first stage experiments. As a result, a 
new set of one and two-variable models was generated. We again applied a selection procedure 
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Figure 7: Individual contributions of input variables in the selected set of models to the relative 
training error. 
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Figure 8: Selected set M2 of 'best' models in up to two-dimensional input space and two 
modeling objectives. 



to the superset of models by selecting only 25% of robust models closest to the Pareto front 
with the training error of at most 1 — R 2 = 0.30 and model complexity of at most 250. The 
resulting set of 587 simplest models, denoted as M.2 is depicted in Figures [8] and [9} 

Figure [9] is obtained using the VariableContributionTable function of DataModeler, and it 
exposes the trade-offs for input spaces and prediction accuracy for energy prediction. 

We emphasize here that this is the decision and the responsibility of the domain expert 
to pick an appropriate input space for the energy prediction models. This decision will be 
guided by the costs and risks associated with different prediction accuracies, and by the time 
needed to perform measurements of associated design spaces. The responsibility of a good 
model development tool is to empower experts with robust information about the trade-offs. 

At the last stage of model analysis we used the CreateModelEnsemble function of DataMod- 
eler to select an ensemble of regression models from M2 but only allowing models with model 
complexities not exceeding 150. As can be seen in Figure [8j an increase of model complexity 
does not provide a sufficient increase in the training error. Since our goal is to predict energy 
production on a completely new interval of weather conditions (here: July 2011) we settle for 
the simplest models to avoid potential over-fitting. 

The selected model ensemble consists of six models presented in Table [2} The values of 
model complexity, training error, and test erroiQfor six models in the ensemble are respectively 

4 Test error is, of course, evaluated post facto, after the models are selected into the model ensemble. 
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Variable Combination Table 
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Figure 9: Visualization of models in M.<i niched per driving variable combination. Note, that 
windGust2 alone is insufficient to predict energy output with the accuracy that is achieved when 
windGust2 and dewPoint are used. The model error is computed using training data. 



Table 2: Model Ensemble (six models) selected from M.2- Constants are rounded to one digit 

after comma. 
-32.1 + 2.9 (-v/windGust2 + windGust 2 ) 
112.0 - 3.5 * 10~ 5 (-1956.3 + dewPoint 2 + windGust 2 2 ) 2 

-6.4 + 1.3 * 10~ 4 (9 - v / windGust 2 ) 2 windGust 2 2 (-9.9 + dewPoint + 2windGust 2 ) 

-4.5 + 4.3 * 10~ 4 (-8.9 + % /windGust 2 ) (- % /windGust 2 + 0.1windGust 2 ) windGust 2 (-12 + dewPoint 2 + windGust 2 2 ) 

-3.1 + 1.5 * 10~ 4 (-3 dewPoint windGust 2 2 + (9 - v / windGust 2 ) 2 windGust 2 2 (-16.3 + dewPoint + 2windGust 2 )) 

-11.2 + 9.4 * 10~ 7 (9 - v / windGust 2 ) 2 v / windGust 2 (39.4 + 4dewPoint + 7windGust 2 ) (§ + dewPoint + (10 + 2windGust 2 ) 2 ) 



(24,0.299,0.426), (42,0.247,0.472), (63,0.209,0.146), (78,0.207,0.149), (121,0.205,0.145), (124,0.211,0.145). 
The created model ensemble can now be evaluated on the test data. As mentioned in Sec- 



tion 2.1 ensemble prediction is computed as a median of predictions of individual ensemble 
members, while ensemble confidence is computed as a standard deviation of individual predic- 
tions. We report that the normalized root mean squared error of ensemble prediction on the 
test data is RMSE Tcst = 12.6%. 

Figure 10 presents the predicted versus observed energy output in July 2011, with whiskers 
corresponding to ensemble confidence. Note that the confidence intervals for prediction are 
very high for many training samples. This is normal and should be expected when prediction is 
evaluated well beyond the training data range. Figure 11 presents ensemble prediction versus 
actual energy production over time in July 2011. 



5 Conclusions 

In this study we showed that wind energy output can be predicted from publicly available 
weather data with accuracy at best 80% B? on the training range and at best 85, 5% on the 
unseen test data. We identified the smallest space of input variables (windGust2 and dewPoint), 
where reported accuracy can be achieved, and provided clear trade-offs of prediction accuracy 
for decreasing the input space to the windGust2 variable. We demonstrated that an off-the-shelf 
data modeling and variable selection tool can be used with mostly default settings to run the 
symbolic regression experiments as well as variable importance, variable contribution analysis, 
ensemble selection and validation. 

We are looking forward to discuss the results with domain experts and check the applicability 
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Figure 10: Ensemble prediction versus observed energy output in July (Test Data) of the 
final model ensemble. Whiskers correspond to ensemble disagreement measured as a standard 
deviation between predictions of individual ensemble members for any given input sample. 




Predicted (Red) vs. Observed (Blue) Energy Output over Time in July (Test Data) 




Figure 11: Ensemble Prediction versus Actual energy output over time on the Test Data. 
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of produced models in real-life for short term energy production prediction. We are glad that 
the presented framework is so simple that it can be used literally by everybody for predicting 
wind energy production on a smaller scale — for individual wind mills on private farms or urban 
buildings, or small wind farms. For future work, we are planning to study further the possibilities 
for longer-term wind energy forecasting. 
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